Authors: Francesco Giardina, Yunke Peng, Benjamin Stocker
The motivation of making open science available include below:
When getting back to a work after long-term absence, it is easily to confuse and forget current version of written code and data. But a good data management tool and code update management can prevent them, you can check any changes or comments in any past time, and github will record all your changes made on your code.
A good management of data and code, will also make good benefit when the publication is under revision condition. Imagining that reviewers want to you check some other new variables. Such new variables only existed in your original dataset but not in final aggregated dataset. If code is managed well then it is very efficient to further collect, trace and combine them in your further analysis.
It will also help when working with other collaborators. You can work by sharing code/data very efficiently, merged them without the worry for conflicts of code.
Using a Rstudio repository project, will make sure all analysis within one paper can be well managed and checked.
The good use of data management (e.g. between Euler, local repository and pubilc data) will also speed up the data analysis and make all work well organized.
A good track of code management in git, will also help the whole team work efficiently when de-bug or update a newer version together.
The importance of good data management transcends the success of a scientific project itself: it guarantees the reproducibility of results, thus encouraging the scientific community to reuse a workflow - a key concept for the advancement of Science.
After an attempt to standardize the practice of open science in 2007, it was not until 2016 that representatives of academia, industry, funding agencies and publishers convened to define a set of principles that would be known as the FAIR Data Principles. Later that year, during its 2016 G20 Hangzhou summit, the G20 approved the use of FAIR principle within research. In short, the FAIRification process consists in data being:
Open Science refers to the effort to make scientific research available to peers in academia and to the public, to foster the dissemination of findings and the development of knowledge. It involves a methodical change to the research cycle, promoting more exchange and collaboration. The shift to an open science practice is not trivial, and often encompasses a cultural change, but it will be essential to strengthen the link between scientific research and society. Contributing towards this effort, the European Union is leading the way with several initiatives:
The transition to open science over the past few years has been motivated by a “reproducibility crisis”, i.e. the failure to replicate scientific results. Recent evidence has found that less than half of scientific findings may not be reliable. XXX reference XXX. Practicing open science offers a solution to this problem, as sharing materials and data eases the replication of original studies by other scientists. Open science is made possible by open data, i.e. when the data itself is freely accessible to other peers and the public. Policies enabling open data allow researchers to innovate, starting from existing knowledge. Other essential benefits for scientists include reputational gains, increased visibility and impact, and a broader increase in the reliability of research, as it enables the replication and verification of scientific results. This in turn boosts citizens’ trust in science.
The revolution of open science is not limited to the academic world alone. Society is a consumer of science - from the computer you’re reading this article on, to the vaccines that are taking the world out of the current pandemic.
Another key advantage of scientific results being accessible without a fee is that anyone can benefit from them, regardless of their location or economic situation. This is especially important for the scientific community in developing countries. In the end, open science speeds up the circulation of new information and helps generate solutions to the current great challenges. As Isaac Newton said in his 1675 letter: “If I have seen further, it is by standing on the shoulders of Giants.”
Last but not least, by making your code, data, and methods available, you are creating more citable items per every published project. This will grant more visibility to your research, as well as possibly increasing the total number of times your work get referenced. It can trigger new collaborations and strengthens the impact of your research.
Obstacles to the development of open science trace back to when journals were still in paper format, and storing and sharing materials and data was difficult. Luckily for us, technology is making it easier to practice open science.
Another problem is that journal articles are basically a summary of long months (and sometimes years) of research. This means that in the final report, some part of it must be neglected, for the sake of brevity. Sometimes, the revision process itself does not check the primary data and materials. Moreover, by experience, papers with a clean story and clear results tend to be published more easily. This has pushed scientists to disregard studies that did not bring such results. But in the end, this phenomenon causes a impoverishment of science, as other scientists can benefit from those “failed” attempts. Being fully transparent is thus an opportunity to foster critical thinking too, as real data can (very) often be chaotic.
Finally, another obstacle is the time constraint imposed by the current incentive structure. In the current academic world, publications are the easiest way for scientists to propel their careers forward. In this frame of things, it is thus hard to dedicate more time to a process which may not bring direct benefits in terms of primary research articles. However, considering the bigger picture as described above, open science bring advantages both to the single researcher and the scientific community.
The following describes principles and rules for open science practice as performed in our research group - Computational Ecosystem Science at ETH Zurich. The key concepts are general and specific rules to be considered as examples.
~/my_current_project/). All scripts that you use for the analysis to in here. Outputs from analyses or model runs, go to a separate sub-directory data/. Figures may go to fig/.~/my_current_project/ (or in a subdirectory analysis/) that documents all the steps - from reading in the original data, to data processing steps (document any decision regarding processing you made), and to final published plots and numbers. For bigger projects, you may keep separate analysis scripts (RMarkdown or Jupyter Notebooks) for different parts of the analysis, and you may add text describing which other scripts performed what part of the analysis in what order. For such bigger projects, it is important to document the main steps of the analysis and respective scripts in the README.Keep things separate and in the right place is key. You may want to use the same code and data across multiple computers and possibly a remote server for High Performance Computing. It is therefore advisable to adopt an identical directory structure on each computer. This allows you use identical relative paths (see here) in scripts that work across platforms, and enables easy syncing of files across platforms.
It is advisable to keep all git repositories at a top level in your working environment, e.g., in your home, e.g., (~/my_repo1, ~/my_repo2, etc.). Code for your project (containing all steps from reading original data to final results and figures) goes into these and is synced with github (or any other git remote host service, see Chapter ‘Git in Project Management’). Note that figures produced by your code and intermediate output should not be added to git.
You may want to collect any particularly useful functions that may be used in several different projects in your own personal utility package/library. An example is the rbeni package by Benjamin Stocker. Make sure it’s well documented.
Original data, as downloaded from the web, goes into a sub-directory, which is also sitting at a top level in your working environment, e.g., in your home (~/data).
The project directory and other utility packages are synced with Github. Using git, the code of entire projects can be synced across multiple workstations and with your personal directory on your institution’s cluster or cloud services (e.g. the HPC cluster ‘Euler’ at ETH Zurich).
XXX to be added: the use of ~/.bash_profile
Original data, as downloaded from the web, should always be in the local data directory ~/data and synced with the remote server. Once you downloaded data files from a particular source (e.g. accompanying a paper, or a particular satellite missing, etc.), create a sub-directory within ~/data and place it in there, along with a README file (plain text) where you specify when and how the data was obtained (contact person, URL, etc.), what the data contains (variables, units of variables if not provided in the files directly, etc.), what data use policies apply, who obtained the data (your name) and how it should be cited. If original data was processed to a different format or re-gridded, please note it in the README and refer to respective scripts which may be added to the same directory (e.g., proc_this_data.sh) (but not any further analysis).
Note that intermediate outputs produced by your project’s code should not go on git. Git is only to back up code (and small plain-text data files, on the order of a about <100 MB). Still you may want to sync these intermediate outputs with your cluster as a backup of your analysis.
Making your analysis reproducible - from the original files you downloaded on the web to final results and figures - is our highest standard. No excuses! Before submission of a paper, you should be able to demonstrate reproducibility.
The key file here is a “workflow” script written on RMarkdown or Jupyter Notebook, that may run all steps of your analysis in the right order (or at least contains accurate descriptions of how stuff is run).
A good habit is to copy figures produced by the workflow script to a separate directory synced with your cloud service (e.g. Polybox at ETH Zurich) that also contains the manuscript file. You may also have this as yet another git project directory. Upon publication, all code and important outputs of the analysis are to be uploaded on a permanent repository and assigned a digital object identifiyer (DOI), e.g, on Zenodo (or any other service that generates DOIs). Regularly tag your git project at important stages (e.g. first submission, final submission, etc.). Tags can form a release within Github and releases can be synced with Zenodo where each release gets its DOI (see chapter below). Refer to the code and output DOIs in your published paper.
.RData, or .rds), unless these are temporary files used for intermediate steps. All published data must be made accessible under FAIR principles. This means that data formats should be readable without reliance on proprietary software. In general, write data frames into CSV files and geo-located raster data into NetCDF or GeoTIFF..RData, or .rds files, put only one variable into them and name the file the same as the variable. E.g. file df_combined.RData contains only one variable, namely df_combined.RData.The following provides an example that contains all steps of an open science workflow:
Obtain and document data
Organise analysis code in a git repository (and RStudio project)
Document analysis and communicate results
Sync analysis outputs with a remote server
Code versioning and publication
Below is an example to get data publicly. Generally, please always read README carefully, and especially the data-use policy for further analysis and publications.
TRY database: https://www.try-db.org/TryWeb/Prop2.php It includes all possible leaf traits (mainly in natural environments). Following instructions below. You can find a traits list in left bottom “show traits” and each number would mean an unique traits (e.g. 1 means leaf area). Then just add all numbers you want together, and press continue to fill rest of information. Please note, never putting too many variables one time as it it could make the document too large to download.
Dryad: https://datadryad.org/search A public database to search, browse and download data, and especially include many ecology, plant sciences and climate data globally or regionally. Each database includes a doi so you can cite when using them in your publication.
Zenodo: https://zenodo.org/ This seems to be more widely used than Dryad, as it encompassed all possible research areas than just ecology data. Many publications have used this place to deliver this data locally and forming an unique doi.
Github: an example here: https://github.com/forc-db/ForC A few research also derlivers their data in Github. Although Github cannot produce doi directly (it can only form doi when combing with Zenodo), the advantage of using Github is that, it enables multiple data providers to update data together, and also, it is convenient for them to update data frequently. But this also has side effect - it is difficult to cite them, and also, has to prevent the risk that when data is recently updated but you still use old version of data. So, always keep in mind to git pull repository when downloading this sort of data.
Other sources: Many other public data used for forcing or modelling were already available in Euler. For example, WFDEI, CRU…Always checked Euler first to see if it already exists
See link for a very good guideance: <https://alexd106.github.io/intro2R/project_setup.html>
What advantages would creating a project bring? It helps to best manage a project. For example, all analyses used within this project can be managed, and it will never mix with other types of analysis.
Main steps:
Create a project
Then you will see this new Rproj at the window
Then you can set your project work path in Rstudio, which is just exactly the same place shown in above window. setwd(‘/Users/yunpeng/../first_project’)
Below is an example for using ingestr function to download daily climate data (e.g. Average temperature). This data is available in Euler. To run analysis, you can either download data from Euler to your own hard-drive to run, or directly run in Euler. See below commented path that relevant to Euler.
library(ingestr)
library(ggplot2)
siteinfo1 <- tibble(sitename="site1",lon= -79.2,lat=34.8,elv=61,year_start=1992,year_end=1992)
df_watch <- ingest(
siteinfo = siteinfo1,
source = "watch_wfdei",
getvars = c("temp"),
dir = "/Volumes/My Passport/data/watch_wfdei/", #In Euler: /cluster/work/climate/bestocke/data/watch_wfdei/
settings = list(correct_bias = "worldclim", dir_bias = "/Volumes/My Passport/data/worldclim/")) #In Euler: /cluster/work/climate/bestocke/data/worldclim/
ggplot(data=as.data.frame(df_watch$data),aes(x=date, y=temp))+geom_point()+
ggtitle("Daily air temperature at example site") +
xlab("Date") + ylab("Temperature")
Getting public data from widely used portals e.g. Copernicus data, NASA data, etc. using their APIs
Here we show an example about how to ingest (download) climate data from Euler. Here, we are interested in downloading air temperature at a certain range of time and site.
Guidance of installing ingestr is here: https://github.com/computationales/ingestr
The source data are all available in Euler. See comments below. You can either download data from Euler to hard drive/desktop and then run this code in desktop, or directly run this code in Euler.
Using .gitattrributes to keep data on git: <https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#Binary-Files>
Git add: Adding (updating) changed file in current desktop. Storing this change from working directory to staging area.
Git commit: Commit this change along with some comments at staging area, and prepared before submission.
Git push: push this change from local-repo to remote-repo. In other words, this change can now be shared with other users.
Git pull: Pull the latest version of remote repo’s updates into current local desktop.
Git merge/checkout: merge local and remote’s code together.
Git tag: create tag so that it will points out important release of code. See complete step here: <https://stackoverflow.com/questions/18216991/create-a-tag-in-a-github-repository>
See link: <https://docs.github.com/en/get-started/quickstart/create-a-repo>
First press “Create a repository” in Github. Follow the instructions and then finish. Then, following instructions shown in below. Using “cd file-path” to point out your saved path. Then add files manually, commit and git push
See link: <https://docs.github.com/en/get-started/quickstart/fork-a-repo>
First, find the repo you wish to fork, then cloned it to your own github, and your own desktop basing on below.
When using other users code, why we prefer choose to use fork a repo, rather than directly using others repo (e.g. devtools::install_github(“../..”))? Because:
in this way, it will help us to easier to edit the code. We knew exactly where we saved this code, and we can edit them, trace them (through git.), and push them into our own github anytime!
it will also help us to communicate with other users. For example, you find an apparent mistakes in original branch - you can edit it your own, submit to github and then report to original author.